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Abstract 

The ability of the site-frequency spectrum (SFS) to reflect the particularities of 
gene genealogies exhibiting multiple mergers of ancestral lines as opposed to those 
obtained in the presence of exponential population growth is our focus. An excess 
of singletons is a well-known characteristic of both population growth and multiple 
mergers. Other aspects of the SFS, in particular the weight of the right tail, are, 
however, affected in specific ways by the two model classes. Using minimum-distance 
statistics, and an approximate likelihood method, our estimates of statistical power 
indicate that exponential growth can indeed be distinguished from multiple merger 
coalescents, even for moderate sample size, if the number of segregating sites is high 
enough. Additionally, we use a normalised version of the SFS as a summary statistic in 
an approximate bayesian computation (ABC) approach to distinguish multiple mergers 
from exponential population growth. The ABC approach gives further positive evidence 
as to the general eligibility of the SFS to distinguish between the different histories, but 
also reveals that suitable weighing of parts of the SFS can improve the distinction ability. 
The important issue of the difference in timescales between different coalescent processes 
(and their implications for the scaling of mutation parameters) is also discussed. 
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Introduction 

The site-frequency spectrum (SFS) at a given locus is one of the most important and popular 
statistics based on genetic data sampled from a natural population. In combination with the 
postulation of the assumptions of the infinitely-many sites mutation model (Watterson, 
1975) and a suitable underlying coalescent framework, the SFS allows one to draw infer- 
ence about evolutionary parameters, such as coalescent parameters associated with multiple- 
merger coalescents or population growth models. 

The Kingman coalescent, developed by Kingman (1982a,b,c), Hudson (1983a,b) and 
Tajima (1983), describing the random ancestral relations among DNA sequences drawn 
from natural populations, is a prominent and widely-used coalescent model from which one 
can make predictions about genetic diversity. Many quantities of interest, such as the ex- 
pected values and covariances of the SFS, are easily computed (Fu, 1995) from the Kingman 
coalescent. Its robustness is quite remarkable, and indeed a large number of models can be 
shown to have the Kingman coalescent, or a variant thereof, as their limit process, cf. e.g. 
Mohle (1998). A large volume of work is thus devoted to inference methods based on the 
Kingman coalescent; see e.g. Donnelly and Tavare (1995), Hudson (1990), Nordborg 
(2001), Hein et al. (2005) or Wakeley (2007) for reviews. 

However, many evolutionary histories can lead to significant deviations from the Kingman 
coalescent model. Such deviations can be detected using a variety of statistical tools, such as 
Tajima's D (Tajima, 1989a), Fu and Li's D (Fu and Li, 1993) or Fay and Wu's H (Fay and 
Wu, 2000), which are all functions of the SFS. However, they do not always allow to identify 
the actual evolutionary mechanisms leading to such deviations. Developing statistical tools 
that allow to distinguish between different evolutionary histories is, therefore, of fundamental 
importance. 

The present work focuses on properties of the (folded and unfolded) SFS in the infinitely- 
many sites model for three population histories: (1) classical Kingman coalescent, (2) popu- 
lation growth, in particular exponential population growth, and (3) high fecundity coupled 
with skewed offspring distributions (HFSOD), resulting in gene genealogies being described 
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by so-called Lambda-coalescents (Sagitov, 1999; Pitman, 1999; Donnelly and Kurtz, 
1999). Briefly, multiple merger coalescents may be more appropriate for organisms exhibiting 
HFSOD than the Kingman coalescent (cf. eg. Beckenbach, 1994; Arnason, 2004; Eldon 
and Wakeley, 2006; Sargsyan and Wakeley, 2008; Hedgecock and Pudovkin, 2011), 
see also a recent review by Tellier and Lemaire (2014). 

Both recent population growth as well as multiple-merger coalescents may lead to an 
excess of singletons in the SFS compared to the classical Kingman coalescent based SFS, 
which e.g. contributes to shifting Tajima's D values to the negative. Indeed, Durrett and 
Schweinsberg (2005) prove that Tajima (1989b) 's D will be negative, at least for large 
sample size, under fairly general multiple-merger coalescents. 

The associated genealogical trees are, however, qualitatively different. While moderate 
fluctuations in population size lead to a time-change of the Kingman coalescent (Kaj and 
Krone, 2003), multiple merger coalescents by definition change the topology of the ge- 
nealogical tree. There is thus hope that each demographic effect leaves specific signatures 
in the resulting SFS, not only with respect to an excess of singletons, but for example also 
with respect to its right tail. 

Indeed, one observes that the Kingman coalescent will not be a good match to genetic 
data containing many singleton polymorphisms due to a lack of free (coalescent) parameters, 
as opposed to multiple merger and population growth models, which both can predict an 
excess of singletons. Encouragingly, multiple merger and growth models exhibit noticeable 
differences in the bulk of the site-frequency spectrum, in particular in the lumped tail (Fig- 
ure 1). In Figure 1, the normalised expected spectrum y?^' n (2) for a given coalescent II, ie. 
the expected spectrum scaled by the expected total tree length, is compared for a particular 
multiple-merger coalescent (B; Schweinsberg, 2003), and exponential (E) growth models 
for sample size (number of leaves) n as shown. The first five classes (representing relative 
length of external branches, two-leaf branches, etc.) are shown, with classes from six on- 
wards collected together (labelled as '6+'). In Figure 1, the relative external branch lengths 
were matched between the different coalescent processes. Even though the relative exter- 



5 



Downloaded from http://biorxiv.org/on September 18, 2014 



nal branch lengths, and by implication the relative number of singletons, can be matched 
between the different processes, the lumped tail (group 6+ in Figure 1) differs between the 
multiple-merger coalescent (B), and exponential growth (E). 

Matching the relative external branch lengths (p^' n (2) and observing how the rest of 
the normalised spectrum behaves, as illustrated in Figure 1, gives hope that multiple merger 
processes may be distinguished from (at least) particular population growth models with 
adequate statistical power. The quantity (p^' n varies as a function of n and the associated 
coalescent parameter (Figure SI). In the limit of large n, for the Kingman coalescent, 
^ K = 0(l/logW). 

Inference methods for distinguishing population growth from the usual Kingman coales- 
cent have been extensively studied, see e.g. Tajima (1989a), Slatkin and Hudson (1991), 
Rogers and Harpending (1992), Kaj and Krone (2003), Ramos-Onsins and Rozas 
(2002), and Sano and Tachida (2005). Detecting multiple merger coalescents in popu- 
lations deviating from the Kingman coalescent assumptions is a relatively new direction 
of research. Indeed, deriving inference methods based on multiple merger coalescents has 
only just begun (Eldon and Wakeley, 2006; Birkner and Blath, 2008; Eldon, 2011; 
Birkner et al, 2011, 2013a,b; Steinrucken et al, 2013; Koskela et al, 2013). In par- 
ticular, Birkner et al. (2013b) obtain recursions for the expected site-frequency spectrum 
associated with Lambda-coalescents. In the present work we address the issue of detect- 
ing multiple merger coalescents from exponential population growth, using methods based 
on the SFS, by estimating statistical power for both point and interval hypotheses. As an 
alternative approach, we also consider a simple implementation of approximate Bayesian 
computation (ABC; Rubin, 1984; Tavare et al., 1997; Pritchard et al., 1999; Cucala 
and Marin, 2013; Baragatti and Pudlo, 2014). 
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Figure 1: Matching ip{'' (2) for the different coalescent processes II G {E, B,A} for (3 (b), a 
(a), and 7 (g) for algebraic growth (see Supporting Information) for comparison, and number 
of leaves n as shown. Expected values were computed exactly. 
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Theory and Methods 

Basic properties of the site-frequency spectrum 

Consider a sample of n DNA sequences taken at a given genetic locus and assume that 
we can distinguish between derived (new mutations) and ancestral states. For n G N let 
[n] := {1, . . . , n}. We denote by Q n * the total number of sites at which the mutant base 
appears i e [n — 1] times. Then, 

An) ._ ( An) An) \ 

£ - — I SI > • • • > Sn-1 J 

is referred to as the unfolded site-frequency spectrum based on the n DNA sequences. If 
mutant and wild-type cannot be distinguished, one often considers the folded spectrum 
r/™) := (j)^, . . . , ^[^/2jl ' where ancestral and derived states are not distinguished, and 
hence 

A n ) _i_ A n ) 

i-ul ' l<i<[n/2\, 

(Fu, 1995). In this study, we will mostly be concerned with the unfolded site-frequency 

spectrum. Define C| := s : | n ' ) /ls :( - n ^l where |^ n ^| := ^ H l-^i-i denotes the total number 

of segregating sites. Thus, = (([ n \ ■ ■ ■ , Cn^i) ^ s 'normalized' unfolded SFS, with 
the convention that C = 0 in the trivial case of complete absence of segregating sites 

(ie (n) i = o). 

In order to compute expected values, variances and covariances of the SFS, an explicit 
underlying probabilistic model is needed. In the following we assume that the genealogy of a 
sample can be described by a coalescent process, more precisely by either (a timechange of) 
the Kingman coalescent or a multiple-merger coalescent. In addition, the infinitely-many- 
sites mutation model (Watterson, 1975) is assumed, and mutations are modeled by a 
Poisson-process on the coalescent branches with rate 9/2. 

Closed-form expressions for the expected values and (co)variances of have been de- 
termined in Fu (1995) when associated with the Kingman coalescent. One can represent the 
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expected values of £^ in a unified way using the results of Griffiths and Tavare (1998), 
Kaj and Krone (2003) and Birkner et al. (2013b), that allow to treat the expected values 
(and covariances) of the SFS for all coalescent models in questions. 

Let n n = (I\-t,t > 0) be a (partition- valued exchangeable) coalescent process started 
from n leaves (partition blocks) corresponding to the random genealogy of a sample of size 
n. By discussing 'leaves' rather than DNA sequences we are emphasizing our viewpoint 
of the genealogy as a random graph, where the leaves are a particular kind of vertices. 
Our emphasis is on the topology of the genealogy, rather than the associated site-frequency 
spectrum. 

If the initial number of leaves is not specified, we simply speak of II. One may think of 
IT as the Kingman coalescent, but the point is that the following result will stay true also 
for externally time-changed Kingman coalescents as well as asynchronous multiple merger 
coalescents (a.k.a. 'Lambda'-coalescents in the mathematical literature), and even externally 
time-changed multiple merger coalescents. 

Given n and a coalescent model II, let (F^'joo be the block counting process of the 



underlying coalescent IT 1 started from n lineages, i.e. Y t gives the number of ancestral 
lines (blocks) present/active at (backwards) time t. For 2 < k < n, let be the random 
amount of time that spends in state k. Given a coalescent IP started from n (unlabelled) 
lineages, denote by p^' u [k, i) the probability that, conditional on y w taking value k, a given 



one of the k blocks subtends exactly i G [n—1] leaves. A general representation of E n 
is then 
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One can interpret the quantity y3- n ^ n as the probability that a mutation, under the infinitely 
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many sites assumption and the coalescent model II, with known ancestral types, appears 
i times in a sample of size n. Importantly, <p\ n is not a function of the mutation rate, 



unlike E n 
value E n 



(n) 



C 



(n) 



first-order approximation of the expected 



. One can also view <p\ 
of the normalised SFS. 
As examples for II we will consider the classical Kingman coalescent (K), exponential 
growth (E), and the Beta(2 — a, a) multiple-merger coalescent (B). These are recalled in 
the Supporting Information. The quantity <p\ n ^' U can be compared with in a minimum- 

(n) B 

distance statistic (see below). Simulations suggest that ip\ n> ' is a decent approximation of 



E 



B 



(■ n ^ when a is not too close to 1, and n not too small (Birkner et al, 2013b). Similar 
conclusions hold in the case of exponential growth (Figure S2). 

Recursive formulae for the covariances of the SFS can also be established; see e.g. (23) 
in Fu (1995) for the Kingman case or Birkner et al. (2013b) for the Lambda-case. For 
general models of changes in population size cf. section 'The SFS under variable population 
size' in the Supporting Information. 



(n) 



obtained 



Comparing the observed (-^ (instead of to its expected value E n 
under a particular coalescent model II - enables one to do inference without having to jointly 
estimate the mutation rate 9 using e.g. a minimum-distance statistic. Unfortunately, there 



(n) 



simple function of the coalescent 



seems to be no explicit way of representing E n 
parameters and sample size n. One may, instead, compare Q n ^ to (2) (Birkner et al, 
2013b). 



Timescales, segregating sites and mutation rates 

The choice of a coalescent model (resp. demographic history) II and its underlying parameters 
strongly affects classical estimates for the coalescent mutation rate 9/2 (i.e. the Poisson rate 
at which mutations appear on coalescent branches). Assume w.l.o.g. for all multiple merger 
coalescents in question that the underlying coalescent measure A is always a probability 
measure: This normalisation fixes the coalescent time unit as the expected time to the 
most recent common ancestor of two individuals sampled uniformly from the population. In 
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particular, 9 can then be interpreted as the expected number of observed pairwise differences 
in a sample of size two. 

Given an observed number of segregating sites S in a sample of size n, a common estimate 
9 U of the scaled mutation rate 9 is the Watterson estimate 

' E n [£(")]' 1 ^ 

where E n \B^ n >] is the expected total tree length of the underlying coalescent model IT. One 
can of course also estimate 9 as a linear combination of the site-frequency spectrum (cf. 
Achaz, 2009) in the case of the Kingman coalescent. Using the recursions for E n £^ 
obtained by Birkner et al. (2013b), one can now also estimate 9 analogously in case of a 
Lambda-coalescent . 

The real-time embedding of genealogies can vary drastically between different model 
classes. Mathematically, coalescent processes can be obtained as the limits of genealogies 
in Cannings models (Cannings, 1974, 1975) (resp. similar models) under a suitable time- 
change. The real-time embedding is the expression of coalescent times on the time scale of the 
underlying Cannings models. This important aspect of coalescent models is subtly hidden 
in the actual resulting limiting models. For example, given a Cannings population model of 
fixed size N, let cn be the probability that two gene copies, drawn uniformly at random and 
without replacement from a population of size N, derive from a common parental gene copy in 
the previous generation (see (S10) in the Supporting Information for an explicit formula and 
further details). While for the usual Wright-Fisher haploid model c N = 1/N, in a population 
model studied by Schweinsberg (2003), which leads to the Beta(2 — a, a)-coalescent, cn 
is proportional to l/iV a_1 , for 1 < a < 2. By a limit theorem for Cannings models of 
Mohle and Sagitov (2001), one coalescent time unit corresponds to l/c N generations in 
the original model with population size N. In this framework, the expected total tree length 
E B [£?( n )] (measured in coalescent time units) decreases as a function of a G (1,2], while the 
corresponding quantity (measured in generations) E [B^]/c^ increases (Figure S3). 

Accordingly, the mutation rate /i at the locus under consideration per individual per 
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generation must thus be scaled with l/c N (as noted e.g. in Eldon and Wakeley (2006)), 
and the relation between /x, the coalescent mutation rate 9/2 and cat is then given by the 
(approximate) identity 

" = A - (4) 
2 c N K ' 

This allows one to obtain an approximate real-time calibration of the coalescent time unit 

1/ctv, given external knowledge of the per-generation mutation rate /i, and an estimate for 

9, e.g. based on (3). 

The time-scaling applied to a classical Wright-Fisher model with fluctuating population 
size (as in Kaj and Krone (2003)) in order to obtain a (time-changed) Kingman coalescent 
is recalled in the Supporting Information, see in particular Equation (Sll). Again, the 
estimate (3) of 9 depends on the exponential growth parameter (3. 

Our aim is to construct a test resp. a decision rule to distinguish between the two model 
classes E and B (which intersect exactly in K). Hence, our methods can also be used to dis- 
tinguish growth from stationary models and multiple merger coalescents from the Kingman 
coalescent model, thus complementing existing literature (cf. eg. Ramirez-Soriano et al, 
2008; RAMOS- Onsins and ROZAS, 2002). Our investigation can be applied to more general 
Lambda-coalescents or other growth models. In the Supporting Information, we additionally 
consider the case of algebraic (power law) population growth, which may be applicable when 
the geometry of a habitat (say, a coastline) restricts the growth of a population. 

Approximate likelihood ratio tests for the SFS 

In order to distinguish, say, E from B based on an observed site-frequency spectrum, a natural 
approach is to construct a (unnested) likelihood-ratio hypothesis test. Suppose our null- 
hypothesis H 0 is presence of recent exponential population growth E with some parameter 
ft G [0, oo), and we wish to test it against the alternative Hi hypothesis of a multiple merger 
coalescent, say, the Beta(2 — a, a)-coalescent (Schweinsberg, 2003) for some a 6 [1,2]. 
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To clarify our notation, define the hypotheses we will work with; 

O 0 := {(exponential growth (/?), 9) : /3 G [0, oo), 9 G (0, oo)} 



and 



©i := {(Beta(2 - a, a)-coalescent, 9) : a G [1, 2], 9 G (0, oo)} 



where we interpret (3 = 0 and a = 2 as corresponding to the Kingman coalescent. Later, 
we will fix the expected total number of segregating sites, which will make 9 a function of 0 
(resp. a). As shorthand, we will use <dp := [0, oo) and Q a = [1,2] where the intervals refer 
to 0 resp. a. Recall that £^ is the observed site frequency spectrum for a sample of size n 
and, given mutation rate 9/2 > 0, let (p^' n (2) be the normalized expected site frequency 
spectrum. 

Let H 0 := {$ E 6 0 } denote the null-hypothesis and H\ = G 61} denote the alter- 
native. Let be the likelihood function of the observed frequency spectrum under 
model Then, one can define the likelihood-ratio function 

(n) . = su P {L(^)),^Ge 0 } 
" ' sup{L(?U (n) ),tf G 0i}' 

One would reject the null-hypothesis H 0 if g is sufficiently small. More precisely, given a 
significance level a G (0, 1) (say, a = 0.05), one needs to determine a constant g* such that 

su P p4f?(? (n) ) < Q*\ <a. (6) 
t)ee 0 L J 

The corresponding power function G is then given by 

G($) = P4f?(i W ) < Q*h ^8i- (7) 

Exact likelihoods can be computed recursively or via simulation (see e.g. Eldon and Wake- 
ley (2006) and Sargsyan and Wakeley (2008)) from the representation (8) where 
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is the random length of branches subtending i E [n — 1] leaves (DNA sequences) with 

B (n) ._ B (n) + . . . + being the tQtal length of the tree ( recall E^ (n) ] = (6/2)E d [B ( : n) }) 

with expectation being taken relative to coalescent process II with parameter (to ease 
notation, write £?; = B: n '), 

/ n\ ki+-+k n -i n-l 

pnd n) = h, i= i, . . . , n - 1} = r- j e* [ n e ^ ie/2 |v] • ( 8 ) 

For medium to large n, however, it is more practical to consider the approximate likeli- 
hood function 

i=i *■ 

(where s = ^^[B^]) pretending that the classes are approximately independent and 
Poisson-distributed (which is encouraged by the fact that the off-diagonal entries of the 
covariance matrix of £ (n * ) are small compared to the diagonal terms, see Birkner et al. 
(2013b)). The corresponding approximate likelihood-ratio will be denoted by g, and can 
then be used to determine quantiles associated with significance level a as in (6), and power 
function G. 

One often observes ^f 1 " 1 = 0 for most i greater than some (small) number m in observed 
data, in particular for large n. It thus seems natural to consider (approximate) likelihood 
functions for "lumped spectra" (e.g. collapsing all entries in classes to the right of some 
number m into one class m + ). 

Another natural type of lumping may be to collect together classes so that (pf 1 ^ > x 
for some x G (0, 1/2]. This may not always be feasible, though, if the individual (p[ n ^ quickly 
become quite small, and we will refrain from going into a more detailed theoretical discussion 
of optimal lumpings. 

Instead of (approximate) likelihoods, one can also consider rejection rules based on min- 
imal distance statistics: 

M _ = infM^)(^),e (rt) )^ee 0 } 

6 ' inf{c%M(tf),e {n) ),tf G 6i}' 
14 
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for some suitable distance measure d (e.g. the £ p distance with p = 2) with corresponding 
power function G^ d \ If entries in the SFS become sparse and far apart, lumping the right tail 
should increase its relative contribution to the £2 distance, and might thus be more adequate 
for our purposes. In fact, we will observe such an effect in our subsequent analysis. 

Additional information about the parameters in each model, for example reducing the 
respective parameter ranges to one-point hypotheses of type H 0 = {$ = /3*} vs. Hi = {d = 
a*} } is expected to lead to substantially more powerful tests, in particular if both hypotheses 
are well-separated from the Kingman coalescent. 

Approximate Bayes factors and model selection 

Using the previous notation, an analogous Bayesian approach is to choose a model class 
based on a Bayes factor 

Bayes . = /eo^C (w) )<M<SQ 

given a pair of priors 7r 0 , 7Ti on O 0 , 61 (given the same prior probability of each model 
class, this is also the odds-ratio). We use the less informative normalized site frequency 
spectrum (nSFS) of an n sample, denoted by since it is robust to changes in mutation 
rates. Bayes factors based on (lumped) distances d and/or the folded nSFS may also be 
considered. In line with classical Bayes factor philosophy (cf. e.g. KASS and Raftery 
(1995)), one interprets an observed value of q b 3> 1 as evidence in favor of 6 0 over 9i. As 
exact likelihoods L($,(^) are often impractical, we employ approximate Bayesian methods 
(see e.g. Beaumont (2010)). 

For the ABC analyses, we again focus on exponential growth (E) versus Beta coalescents 
(B) and denote the corresponding Bayes factor by g E ^ B . Recall that for fixed n, both model 
classes can be parametrised by two parameters, mutation rate 9 G (0, 00) and exponential 
growth rate /3 G [0, 00) resp. the Beta coalescent parameter a G [1,2]. To choose prior 
distributions on these two-dimensional parameter sets Go and 0i we first record/assume the 
number s of segregating sites in the data. Then, we set marginal prior distributions for the 
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growth parameter f3 resp. the Beta coalescent parameter a. Finally, for each {3 resp. a, the 
mutation rate is determined as the Watterson estimate 6 n (3), which of course depends on 
f3 resp. a through the total branch length E n [B^] . We chose the mutation rate so that 
the expected number of mutations under the chosen coalescent model equals the Watterson 
estimate based on the (assumed) number of observed mutations. 

For convenience, we employ a simple rejection-based ABC scheme to approximate the 
Bayes factor for the model (class) comparison given an observed nSFS (resp. folded and/or 
lumped versions, which can be treated analogously). First we simulate n reps independent 
samples of the nSFS under each model class and record the distance to the observed nSFS 
in question. We then fix a tolerance level x G (0, 1) and count the number of simulations tie 
from the growth model resp. n% from the Beta coalescent model class that are among the x% 
best fits with respect to the £ 2 distance to the observed nSFS (the 'accepted' simulations). 
Here, we use an additional scaling by dividing each class (lumped classes) in the nSFS by 
the median (if non-zero) within this class observed in all simulations as implemented in the 
R package abc (Csillery et al, 2012). The Bayes factor can then be approximated by 

n B 

To assess how well our ABC approach allows us to distinguish the model classes, we use 
two approaches from the R package abc. Both are based on leave-one-out cross-validation. 
More precisely, we pick n cv simulations at random from each model class, treat them as 
the observed value of the nSFS and then run the ABC approach with the same parameters 
and simulations as above. For each cross-validation i M G {1, . . . , n cv }, with M denoting the 
model class E resp. B, we record the counts of accepted simulations tie (^m) and wb^m) from 
each model class. As measures for the distinction ability of this approach, we record for 
each model class the (estimated) mean posterior probabilities it given the observed nSFS, 
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borrowing notation from Stoehr et al. (2014), 



E £ 



7T(E|C 



E 

i B =i 



™e(*b) + ™b(«B, 



and 



E 



(n)> 



n,. 



E 

117=1 



?We) + ?We) 



The mean misclassification probabilities are estimated as 



E £ 



7T 



/ /B > iic 



Ei 

i n =l 



{n E (i B )>n B (i B )} 



and 



E £ 



^ E/B <i|C 



Ei 

» E =i 



{n E (i E )<ng(i E )}- 



To ease notation we will omit n in the formulae. 

In practice, we need to efficiently generate samples of the nSFS under the different models 
which can be achieved by backward-in-time coalescent simulations. For the exponential 
growth models (E), we use Hudson's ms (Hudson (2002)) as implemented in the R (R 
Core Team, 2012) package phyclust (Chen (2011)). For the Beta-coalescents (B), we 
use custom R and C scripts to generate samples of the nSFS. To conduct the actual ABC 
analysis including cross-validation techniques, we employed the R package abc (Csillery 
et al. (2012)). 



Results 
Power estimates 

To assess the sensitivity of our approximate likelihood ratio test associated with the likelihood 
ratio function (5), we consider its power G from (7) as a function of a with H Q = Op 
and Hi = Q a . As shown in Figure 2, high power (at least 60%) to distinguish Beta(2 — 
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a, a)-coalescent from exponential growth can be obtained, in particular in the presence of 
sufficiently many segregating sites (Figure 2B). Similar conclusions hold for a smaller sample 
size (n = 100; Figures S12, Sll). The number 300 of segregating sites is nearly the total 
number of polymorphisms (298) observed by Carr and Marshall (2008) who scanned 
whole mitochondrial genomes (15, 655 bp) of the highly fecund Atlantic cod (Gadus morhua). 
For comparison, Figure 3 shows power estimates for observed parameters ((n, s) = (30, 300)) 
resembling the data considered by Carr and Marshall (2008). Reversing the hypotheses 
shows similarly promising power estimation results (Figure 4) for n = 200. We do not have 
a clear explanation for why the power is not monotone as a function of (3, in particular for 
smaller Type I error. 

When sample size is large, one will typically encounter many zeroes in a given observed 
site-frequency spectrum, so that we will lump the right tail of the spectrum at some threshold 
m + . Our power estimates suggest that keeping at least the first five classes intact, and 
collecting the rest into one other class, has little effect on the power of the test (results not 
shown). Keeping only the singleton class intact, and collecting all the rest into one 

class, however, significantly diminishes power (results not shown). 

C (cf. Kernighan and Ritchie, 1988) code written for estimating the power, where use 
was made of the GNU Scientific Library (Galassi et al, 2013) and the GMP (Granlund 
and the GMP development team, 2012) and MPFR (FOUSSE et al, 2007) multiple 
precision libraries, applying Romberg Integration (Bauer, 1961), is available upon request. 
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Figure 2: Estimate of power as a function of a; when exponential growth is the null hypothesis, 
and the test statistic is sup{£(tf, £ (ri) ), E 0 O } - sup{£(tf, £ (n) ), d E 6i} (5), with £($,£ {n) ) 
the log of the Poisson likelihood function (9) (no lumping) with n = 200 and segregating sites 
(s) as shown. The symbols denote the size of the test as shown in the legend. The interval 
hypotheses are 0^ = {(3 : (3 E {0, 10, . . . , 1000}} and B a = {a : a E {1, 1.05, . . . , 2}}. 
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Figure 3: Estimate of power as a function of a; when exponential growth is the null hypothesis, 
and the test statistic is sup{£(tf, G 0 O } - sup{£(tf, £ (n) ), d e 6i} (5), with £($,£ {n) ) 

the log of the Poisson likelihood function (9) (no lumping). The different symbols denote 
the size of the test as shown in the legend. The interval hypotheses are = {/3 : (5 e 
{0, 10, ... , 1000}} and 9 Q = {a : a E {1, 1.05, . . . , 2}}. 
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Figure 4: Estimate of power as a function of log 10 (3 when the Beta-coalescent is the null 
hypothesis, and the test statistic is sup{£($, £ (n) ), $ e 0 O } - sup{£($, £(")),$ e 0i} (5), 
with £(^,^ n )) the log of the Poisson likelihood function (9) (no lumping); with n = 200 
and number of segregating sites s as shown. The test sizes are 0.1 (solid line), 0.05 (dashed 
line), 0.01 (dotted line). The interval hypotheses are Bp = {/3 : 0 E {0, 10, . . . , 1000}} 
and 6 Q = {a : a G {1, 1.05, . . . , 2}}. Values at log 1Q /3 = 0 correspond to the Kingman 
coalescent. 
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Given additional information about a or f3 could lead one to test the point hypotheses 
Hq = {(3 = (3*} against Hi = {a = a*} for some fixed f3* > 0 and a* 6 [1, 2], or vice versa. 
An example of power estimates for point hypotheses, as a function of sample size n, is shown 
in Figure S5. High power can be obtained even for relatively small number of segregating 
sites (s = 10). Even excluding the singletons (cf. eg. Achaz, 2008) yields high power for 
some parameter values (Figures S6 and S7). 

Considerations of statistical power for point hypotheses naturally lead to the following 
question: how does the distance between expected site-frequency spectra behave as a function 
of relevant coalescent parameters? Figure 5 is an effort to understand the relation between 
the expected SFS for the two models (E and B) by graphing the distance between expected 
normalised spectra ^ n " > '^ and <pf^ as a function of a and (3. The normalised spectrum is of 
course invariant with respect to mutation rate 6. Figure 5 shows a pattern one would expect 
from predictions of relative lengths of external branches, and hence singleton polymorphisms 
- increasing as (3 increases or a decreases. Similar conclusions can be reached from Figure S4, 
which shows the distances for n = 1000. In particular, one observes that the minimum of 
the curves for eg. (3 = 1000 in Figure S4 shifts as sample size n increases. The smallest £ 2 
distance in Figure 5 is ~ 0.02 - excluding 0 due to comparison of Kingman coalescent with 
itself ((3 = 0 and a = 2). Naturally one would also want to compare the quantiles of Q n \ or 
Q , associated with different processes, since two very different distributions can still have 
the same mean. 

Figures 5 and S4 indicate the presence of a region, essentially a curve in the two- 
dimensional (a, (3) parameter space, along which the lowest £ 2 distance is reached. 
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Figure 5: The £ 2 -distance \^pi — f\ n j J of the expected normalized spectra 

iff 1 n (2) as a function of a and (3 for n = 200. Expected values were computed exactly. 
The gridpoints are a G {1, 1.05, . . . , 2}, 0 e {0, 10, 20, ... , 1000}. The smallest distance for 
the gridpoints chosen is ~ 0.02 - excluding 0 due to comparison of Kingman coalescent with 
itself (p = 0 and a = 2). 
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Mean misclassification probabilities & posterior probabilities for the 
ABC approach 

We analyse how well an ABC approach using the nSFS resp. the folded nSFS and their 
lumped variants as summary statistic can distinguish between exponential growth and the 
Beta(2 — a, a)-coalescent. The distinction ability of the ABC model comparison is assessed 
as described in the Methods section. We specify the following parameters 

• The Beta(2 — a, a)-coalescent parameter a has the uniform distribution on [1, 2] as 
prior distribution. 

• The growth rate /3 has the uniform distribution on the set {0, 10, 20, ... , /3 max } of 
subsequent multiples of 10 as prior distribution. 

• We simulate the nSFS n reps = 2 x 10 5 times for n = 200 in each model class. 

See Tables 1 and S1-S3 for the estimates of posterior probabilities and misclassification prob- 
abilities (some with one replication). 



Table 1: Approximations of the mean posterior probabilities and misclassification proba- 
bilities for the ABC model comparison for tolerance x = 0.01, assumed number s = 60 of 
observed mutations and using either the nSFS or the nfSFS as summary statistics. n cv de- 
notes the number of cross-validations 'lumped' denotes which mutation classes are lumped 
into one class. The maximal growth rate used in all model comparisons is /3 max = 1000. 



E B [tt(E|C)] E e [tt(B|C) 



fold lump 



)!,. 



E B |7r(e E / B > 1|C)| |vr(^/ B < 1|C) 



-E/B 



no 
no 
no 

yes 
yes 
yes 



10+ 
50+ 
100+ 
10+ 
50+ 
no 



24000 
12000 
1200 
24000 
12000 
12000 



0.301 
0.321 
0.332 
0.319 
0.34 
0.343 



0.246 
0.291 
0.293 
0.253 
0.287 
0.291 



0.258 
0.262 
0.282 
0.281 
0.284 
0.287 



0.128 
0.125 
0.17 
0.125 
0.155 
0.162 



Appropriate lumping seems to decrease the error probabilities. For s = 60 observed 
mutations, strong lumping is decreasing the error probabilities for most parameter choices, 
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whereas for s = 300 (Table SI) moderate lumping seems to decrease them the most. Not 
surprisingly growth rates closer to zero are harder to distinguish from the Beta(2 — a, ai)~ 
coalescent models than higher growth rates (see Tables SI and S3). Additionally, a lower 
count of observed mutations leads to higher error probabilities, as does using the folded nSFS 
as summary statistics. 



Discussion 

Distinguishing between multiple merger coalescents and population growth is an important 
task, in particular since patterns of genetic variation produced by the two demographic 
effects, and summarized in the site-frequency spectrum, is expected to be similar. 

The unit of time of different coalescent processes can vary considerably. As discussed 
above, predictions about genetic variation can be compared in at least three different ways. 
One is with time computed in coalescent units, and employing the scaled mutation rate 9. 
Another way is comparing normalised spectra Q n \ whose comparison should be independent 
of timescaling. The third option is to rescale time in generations, which is of course only 
feasible when one can compute the coalescence probability Cjv- Once again, in an ideal 
haploid Wright-Fisher population, cn = 1/N. In an ideal Schweinsberg population, cn is 
proportional to iV 1_a , 1 < a < 2 (Schweinsberg, 2003). The difference between 1/N and 
N^ a can clearly be substantial. Consequently, it becomes quite hard to compare estimates 
of 9 between different processes, since 9/2 « h/cn for any coalescent process (where ji is 
the per-generation mutation rate). Thus, our recommendation would be to compare scaled 
spectra C (n) - 

A key result from our power estimates is that, even for moderate sample size, and based 
on interval hypotheses, the two processes can be distinguished for significant parts of the 
parameter space of a and /3. By using recent results (Birkner et al, 2013b) on computing 



E 1 



An) 
Si 



associated with Lambda-coalescents, and recursions for computing E E Q n> (Sup 



u(n) 



porting Information), allowing us to work within an approximate likelihood framework, we 
obtain very promising results. Thus, given sample size (n) and observed number of segre- 
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gating sites (s), our recommendation would be to always estimate power based on interval 
hypotheses. 

The construction of a formal statistical test to distinguish between a multiple merger 
coalescent process and population growth is complicated in both the minimum-distance and 
the likelihood framework. A suitable distance-metric would be (p > 1) 



z ( n ),n ._ 



0 



p\ 1/p 



or a lumped version thereof, where Var 11 Q n ' denotes the variance of Q associated with 
coalescent process IT. However, we neither have a closed-form expression for the expected 
value nor variance of as a function of sample size or coalescence parameters, let alone 
having any knowledge about the distribution of Zp n ^' n . In the likelihood framework, the 
hypotheses are not nested, and thus it's not clear if convergence to a ^-distribution holds. 
One may instead apply an ABC approach. 

Our analysis for sample size n = 200 suggests that an ABC method based on the £ 2 
distance of simulated values to an observed nSFS is able to distinguish between Beta(2— a, a)- 
coalescents or n-coalescents with growth with reasonably low error rates. This holds true 
at least for a high enough number of observed mutations and models different enough from 
Kingman's n-coalescent (i.e., /3 not close to 0, a not close to 2). The nSFS is used because it 
is more robust to changes in mutation rates than the SFS. We exploit this by estimating the 
mutation rate from the (assumed) number of observed mutations in the sample instead of 
drawing it independently from a prior distribution. This is done to reduce the dimensionality 
of the inference problem. 

It is intriguing that using the complete nSFS as summary statistics in the ABC approach 
yields higher errors than using intermediate resp. strong lumpings of the nSFS. A possible 
explanation for the positive effect of lumping lies in the relationship between the branch 
lengths of the coalescent model, the mutation rate and the SFS. Consider the approximate 
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likelihood function (9). Assume that the distribution of the SFS is approximately composed 



of independent Poisson distributions with parameter |E n 



B 



(n) 



for i G [n — 1]. For a 



Poisson-distributed random variable X with parameter k, we have 



E(X) 



thus 



showing that smaller Poisson parameters yield a higher amount of variation relative to their 
expected value. Thus, classes in the SFS with small underlying branch lengths (which tend 
to be in the right tail of the SFS) and/or a low mutation rate show relatively more variation 
compared to their contribution to the total number of mutations than those with longer 
branches or if the mutation rate is higher. Lumping such classes together, under (9), yields 
again a Poisson-distributed lumped class, but with Poisson parameter being the sum of 
parameters from the classes lumped together. Thus, the variation within this class relative 
to its contribution to the total number of mutations is reduced by lumping. If different 
coalescent models show different mean behaviour of (lumped) classes, lumping reduces noise 
and thus increases the chance to correctly identify the underlying model. Naturally, this 
effect is weakened by higher mutation rates and/or higher sample size n (e.g., consider the 
limit results for the SFS in Berestycki et at (2013) and Kersting and Stanciu (2013)). 



However, the inequalities W 



An) 


> E B 


An) 
Si 


, and E A 


An) 
^3 


< E B 


An) 
^3 



for i 7^ j and 



two different coalescent models A and B, would, if extreme lumping is applied, eg. collecting 
together in one group all mutations other than singletons, reduce the signal of the different 
models. Extreme lumping would thus decrease the chance to correctly identify the model. 
This heuristic could explain the pattern within the error probabilities. To check whether this 
heuristic actually holds true, one would need to have better knowledge of the distribution of 
the SFS resp. nSFS. 

Thus, using an appropriate weighing of the variables in the nSFS resp. SFS should 
improve the power to distinguish between model classes. It would also be a worthwile future 
study to see whether a one-dimensional summary of the SFS similar to Tajima's D or Fay 
and Wu's H, as described in Achaz (2009), could yield similar or even higher power to 
distinguish between the model classes than the complete (possibly reweighted) nSFS. In 
our ABC computations a simple rejection-based approach was applied. More sophisticated 
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techniques are available (see Beaumont (2010) for an overview) that may improve the 
prediction accuracy. 

Marine organisms with external fertilization and Type III survivorship curves, such as 
Pacific oysters or Atlantic cod, and in which each gender produces a large number of gametes 
(May, 1967; Strathmann, 1987), are prime candidates for natural populations exhibiting 
high fecundity and skewed offspring distributions (Beckenbach, 1994; BOOM et ai, 1994; 
Arnason, 2004). Previous analysis of mtDNA of Atlantic cod (cf. eg. Birkner et at, 
2013b) and Pacific oysters (Sargsyan and Wakeley, 2008) supports the hypothesis that 
multiple merger coalescents may be more appropriate than the Kingman coalescent as base- 
line models for high fecundity organisms with skewed offspring distributions. Our ABC 
analysis of the Atlantic cod data of Arnason (2004) (see Supporting Information) indicates 
exponential growth is slightly favored over the Beta(2 — a, a)-coalescent. Arnason (2004) 
does exclude population growth in his analysis. Our results indicate that the SFS informa- 
tion one obtains from the Arnason (2004) data may not have enough polymorphic sites to 
distinguish between exponential population growth and the Beta(2 — a, a)-coalescent. 

The problem of distinguishing between multiple merger coalescents and population growth 
was our motivation. We proceeded by analysing two natural models with clear biological 
interpretation, the Beta(2 — a, a)-coalescent and exponential growth using the site-frequency 
spectrum. The power estimation results are very promising: for even moderate sample size 
we can distinguish between the two models for large parts of the associated parameter spaces 
with high power. Thus, we recommend, for a given sample size (n) and number of segre- 
gating sites (s) to estimate power based on interval hypotheses. Approximate Bayesian 
computation also yields convincing results, at least for a high level of polymorphism. The 
SFS, in general, has enough information to distinguish between exponential growth and mul- 
tiple merger coalescent processes. Comparing our single-locus methods and results to those 
obtained for multiple loci remains an important future exercise. 
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Figure SI: Relative expected length of external branches <^?( n ) (2) for II 6 {B,E} as a 
function of the coalescent parameters a resp. /3; '2.0' and ' — 1' denote the Kingman coalescent; 
number of leaves n as shown. Expected values were computed exactly. 
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The SFS for multiple merger coalescents 

Some basic theory of multiple merger coalescents, which are also known as LamMa-coalescents 
in the mathematics literature will now be briefly reviewed. The theory associated with si- 
multaneous multiple mergers, or so-called S-coalescents (Schweinsberg, 2000; Mohle and 
Sagitov, 2001), could be treated by similar methods, but will not be discussed at this point. 

The discussion of the relevance of Lambda-coalescents in population genetics started 
with Eldon and Wakeley (2006). Recall that a Lambda-coalescent, formally introduced 
by Pitman (1999); Sagitov (1999), and Donnelly and Kurtz (1999), is a partition- 
valued exchangeable coalescent process determined by a finite measure A on [0,1]. When 
A is associated with the beta-distribution with parameters 2 — a and a for 1 < a < 2 
(Schweinsberg, 2003), any particular subset of /c e {2, ...,6} blocks merges into one 
(ie. a particular set of k labelled ancestral lineages coalesce) at rate A^ = B(k — a,n — 
k + a)/B{2 — a, a), where -£>(•,•) is the beta function. In contrast, A^ = l(k=2) f° r the 
Kingman coalescent (A(dx) = S 0 (dx)). The Beta(2 — a, a)-coalescent introduces a coalescent 
parameter a, which can be estimated from genetic data (Eldon, 2011; Birkner et al, 
2013b; Birkner and Blath, 2008; Steinrucken et al, 2013). 

As before, for any Lambda-coalescent LT A ) and i G [n — 1], the expected frequency 
spectrum E n(A) [^ (n) ] is given by (1). Denote by (Y} n ^ ,n ,t > 0^ the block-counting process, 
ie. Y^' U gives the number of ancestral lineages active at time t; with P (Yq U ^ U — n) = 1. 
In contrast to the Kingman case, Y"( n )> n(A) might not hit all possible states between n and 
2 when associated with a multiple merger coalescent. An additional implicit conditioning 
must therefore be employed. Indeed, the term p^' u [k, i] = p( n )' n(A) [k, i] in the corresponding 
version of (1) denotes the probability that, in a Lambda-coalescent started from n unlabelled 
lineages, conditional on hitting a state with k lineages, a given one of the k blocks subtends 
exactly i e [n — 1] leaves. 

While the recursive approach underlying (1) can be generalized to compute variances and 
covariances (Fu, 1995; Birkner et al, 2013b), and in fact to any mixed moment of any order 
(however, at the cost of rapidly increasing complexity), the explicit distribution of the SFS is 
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in general unknown. Berestycki et al. (2013) provide the large n almost sure asymptotic 
behaviour of the SFS for Beta-coalescents (and, more generally, all coalescents with a suitable 
polynomial singularity at 0). In Kersting and Stancitj (2013), the convergence of the joint 
law of the internal branch lengths, as n grow large, to a multivariate Gaussian distribution 
is established in the case of the Kingman coalescent, but the convergence appears to be 
rather slow. The corresponding result for (subclasses of) Lambda-coalescents is desirable, 
but currently still elusive. Again the rate of convergence into the scaling limit is expected 
to be rather slow, and the result might therefore be more of theoretical value. 

The SFS under variable population size 

The effect of fluctuations in population size on the underlying ancestry has been investigated 
in various articles, see in particular Griffiths and Tavare (1998), who derive an analog 
of (1), and Kaj and Krone (2003) who link the Wright-Fisher approximation (with fluctu- 
ating population size) with the limiting genealogy. An important point is that (moderate) 
exogenously determined fluctuations in population size do not affect the binary coalescence 
structure of the genealogy, but enter only in the distribution of the relative length of the 
internal branches, and hence the genealogy can be described by a time-change of a Kingman 
coalescent. In contrast, large fluctuations, in particular very severe bottlenecks, can lead to 
very different coalescent models, see (Birkner et al, 2009). 

Recursions for the expected values and covariances of the site-frequency spectrum asso- 
ciated with moderate fluctuations in population size will now be obtained. The recursion 
can be obtained following Polanski et al. (2003). We consider a time-inhomogeneous 
Kingman coalescent, started in n lineages, where each pair of lines present at time t > 0 
merges at a rate u(t) (instead of rate 1) (cf. Griffiths and Tavare, 1998). Given the 
exogeneously determined time-change v = (u(t),t > 0), the expected frequency spectrum 
E[£ l ^' I/ ],i G [n — 1], is again of the form (1), and the time-change v enters only in the dis- 
tribution of the T^ 1 ' = Tp > ' v , 2 < k < n, that is, the distribution of the lengths of the time 
intervals of the block-counting process Y^' v during which there are exactly k lineages. 
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To evaluate E^"^] one needs information about E[T^'"]. Define 

SM*:= 3j»)-" + T^\ v + ■ ■ ■ + Tf' v , j=n,...,2 (SI) 

to be the time at which the block counting process jumps from j to j — 1 lineages 

(with the convention S^+f := 0). Further, abbreviate, for t > 0 and j 6 2, . . . , n, 

F(t):= [ u{u)du and a 5 := [ e~^ F(s) ds, (S2) 



assuming that the first integral in (S2) is finite. It is possible to compute the marginal density 
of S& ),v using the well-known fact that the density of a convolution of exponentials with dif- 
ferent rates can be written as a linear combination of exponential densities (Polanski et ai, 
2003). Indeed, this leads to the following recursive characterization for the expected values 
of the S { jf >,u and T^' u (see section 'Deriving the expected SFS under variable population 
size' below for a proof): For 2 < m < n and m < j < n let 

m ' II © - a ( j - 1) rn ' 1 } 

(put cS'"' = 0 for j < m). Then we have the representation 

n 

eK'I^E^-- ( S4 ) 



and 



E [2***] = E - Sgf] = c^a k + £ {cf n) ~ cf:l)a 3 , (S5) 

j=k+i 
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which allows us to compute the expectation of easily from (1): 



E [d n) '1 = ^^2 k -Pn,k(i) ■ ■/, "/- ■■■ 

fc=2 L j=k+l 



ci k) a k + e (4 j) -ek- 



(S6) 



Again we emphasize that the p n> k{i) are not affected by population growth, and can be given 
explicitly (exactly as in the Kingman-case, cf. Fu (1995)): 

(n-i-l\ 

C\ i V fc-2 I 

Pn,k{l) ■- i-{k<n-i+l} / n _n • 

U-iJ 

We now specify the main ingredient dj (depending on F(t),t > 0 and hence v(t),t > 0) 
explicitly for two important special cases: 

a) Exponential growth. In the case of an exponentially growing population with 

growth parameter /3, that is, v(t) = e^, 

a j = ±ew(fr 1 @)E 1 fr 1 ®), (S7) 

where 



oo g— x roc g—tx 



Ex{t):= / dx= dx (S8) 

Jt x Ji x 

is an exponential integral function, cf. e.g. (Abramowitz and Stegun, 1964, 5.1.1). 

Two practical issues must be stated at this point. The coefficients cE'"' (S3) involve 
binomial coefficients which become large for large sample sizes (number of leaves), which can 
lead to numerical errors. Multiple precision packages such as GNU MPC (Enge et al, 2012) 
or MPFR (FOUSSE et ai, 2007) can be employed to compute the cE'"' coefficients exactly. 
In addition, one must evaluate numerically the integral Ei(t) in (S8). The computing time 
of Ei(t) using multiple precision packages quickly increases with sample size. Hence, for 
large sample sizes (n > 200), we applied simulations to estimate expected values associated 
with exponential population growth using the results of Griffiths and Tavare (1998), 
specifically Equation (2.6), which gives a simple way of drawing values of the Sj' P ', the 
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successive coalescence times. 

b) Algebraic ('power law') growth. In the case of algebraic growth of the form 
u(t) = t 1 for some 7 > 0, 



While exponential growth is a natural model for a population described by a supercritical 
branching mechanism, this appears less natural in the power-law case. 

Based on Equation (23) in Fu (1995), it is also possible to compute the variance and the 
covariances of the SFS based on expressions for E^T^T^'% 2 < k,l < n, which in turn 
can be obtained from 



E v [T™' v T l (nhu ] = E v [Sr' v SP' u ] - E u [S^[ u S\ nhu ] - E v [Sf hv S\Y] + E^Sf^TL 



where E[Sm v \S k = Sk] can be computed (it is the expectation under a regular condi- 
tional probability) as in (S4) replacing v by z>(-) := u(- + s^, c m ' n ^ by Cm '■= Cm^ and F by 



Formula (S2) for the (aj) invites a question analogous to Myers et al. (2008), namely 
whether different choices of F can lead to the same sequence of coefficients (a,), which is 
however outside the scope of the present work. Besides such purely theoretical consider- 




(S9) 



noting that, in the above notation, 




and 



F{-) = F(s k + -)-F(s k ). 
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ations, a rather large body of work has also been devoted to practical inference of popu- 
lation growth from genetic data (see e.g. Tajima (1989a), Slatkin and Hudson (1991), 
ROGERS and Harpending (1992), and Ramos-Onsins and ROZAS (2002)). Simulation- 
based work include Ramirez- Soriano et al. (2008), who consider the statistical power of 
several tests under population size increase and decrease, and the impact of recombination. 
Ramos-Onsins and ROZAS (2002) consider the statistical power of statistics based on the 
site-frequency spectrum to distinguish deterministic population growth from the Kingman 
coalescent. 



Comparison of <p[ n ^^ and 



z (n) 



The agreement between (p\ n) ' (2) and E Z\ n ' , where Z- n) = B\ n ' /B^ are the relative 
branch lengths subtending i G [n — 1] leaves (equivalent to DNA sequences), with = 
B^ + • • • + denoting the total tree length, is checked in Figure S2. In Figure S2, the 
difference ((pf 1 ^ — z[ n ^ relative to y^-™^ is graphed for each class (i), where the were 

(n) E 

estimated by simulation. Indeed, the agreement between ip\ h and the estimated values 
Z { is quite good for the large range of (3 considered, and improves as n increases. The 
results clearly indicate that agreement between E E [Cf™' ) ] and tpf^^ will also be good. 
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Figure S2: Comparing ipf 1 " and simulated values Z i = B] n ' / for exponential growth 
by graphing {^p(^ — Z^j /Vf*^ against i for n and the exponential growth parameter (3 

as shown, where B^ is total length of branches subtending i 6 [n — 1] leaves, and B^ is 

the total length. Values for were obtained from 10 6 replicates, and y?-™'*' 2 was computed 
exactly. 
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Figure S3: Graphs of 1/E [-B^], the estimated value of 9/2 per observed mutation when 
using the Watterson estimator (3) as a function of a (A), compare with (3); and the estimated 
value of \i per observed mutation (B), using (4) together with (3), and assuming the timescale 
Cat = N l ~ a . The number of leaves n are as shown. In B, time is converted into generations 
by multiplying E B [B^] with N a ~\ when N = 10 5 . One obtains 1/E K [B^] = 0.177, 
0.097, and 0.067 coalescent units for n = 10, 100, and 1000, resp. 
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Comparison of E E [^ (n) ] and E B [^ n) ] 

Exact computations of expected branch lengths E E [_B^] associated with exponential growth 
become inefficient as sample size increases, due to numerical computation of (S8). One can, 
however, estimate E E [i^] using simulations, applying results from Griffiths and Tavare 
(1998). Figure S4 shows the distance between E E [^ n ' 1 ] and E B [^ n ' ) ] as measued by the X 2 
norm (S12), and the distance between the normalized spectrum y?- n ^ and ip\ n ^ as measured 
by the Y 2 norm (S13). Figure S4 corresponds to certain 'one-dimensional slices' through a 
two-dimensional graph that would be the analogue of Figure 5 for n = 1000. 
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Figure S4: Graphs of X 2 (S12) and Y 2 (S13) as a function of a when the 'data' is E E 



resp. (pf 1 ^ (+ for E K resp. <£>- n ' ) ' K ) 



compared to E 



B 



(n) 



resp. <£>f^' for /3 (= b) and sample 



size n as shown. Estimates for E E were obtained from 10 5 replicates. 
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Mean misclassification probabilities and posterior probabilities for 
ABC approach - alternative parameter choices 



Table SI: Approximations of mean posterior probabilities and misclassification probabilities 
for the ABC model comparison for different growth parameter ranges or tolerance rates. 
The nSFS is used as summary statistics. /3 max denotes the maximal growth rate used in the 



growth model, n cv denotes the number of cross-validations; 
classes are lumped into one class. An expected number s = 



'lump' denotes which mutation 
300 of mutations are assumed. 
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Table S2: Approximations of mean posterior probabilities and misclassification probabilities 
for the ABC model comparison (n = 200) for alternative tolerance x = 0.0025 and assumed 
expected number s = 60 of mutations. The nSFS is used as summary statistics. n cv denotes 
the number of cross-validations 'lumped' denotes which mutation classes are lumped into 
one class. The maximal growth rate used in all model comparisons is /3 max = 1000. 
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Table S3: Approximations of mean posterior probabilities and misclassification probabilities 
for the ABC model comparison (n = 200) for alternative tolerance x = 0.001, assumed 
expected number s = 60 of mutations and alternative prior ranges and distributions. The 
nSFS is used as summary statistics. n cv denotes the number of cross-validations 'lumped' 
denotes which mutation classes are lumped into one class. For growth rate 0, the prior is 
uniformly distributed on {/3 m i n , f3 m \ n + 10, . . . , /3 ma x}- For coalescent parameter a, the prior 
is uniformly distributed on [a m i n , a max ] 
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Difference in timescales 

Coalescent models are usually obtained as approximations of real populations with finite, 
and often very different, population size N. The difference in time-scale can thus be quite 
significant. This important aspect of coalescent models is subtly hidden in the actual re- 
sulting limiting models. Indeed, the mutation rate /i at the locus under consideration per 
individual per generation, as was noted by Eldon and Wakeley (2006), must be scaled 
in proportion to 1/c/v, where is the probability that two gene copies, drawn uniformly 
at random and without replacement from a population of size N, derive from a common 
parental gene copy in the previous generation. This follows from the famous limit theorem 
for Cannings models of Mohle and Sagitov (2001). In the usual Wright-Fisher haploid 
model, = 1/N and 6 is proportional to fi ■ N. In a general Cannings model, 

cn = Kt^zM (S10 ) 

where V\ is the random number of offspring of individual one (arbitrarily labelled) in any 
given generation. Schweinsberg (2003) gives a timescale associated with the Beta(2 — 
a, a)-coalescent, where 1/cjv is proportional to iV a_1 , 1 < a < 2. Thus, 9 is proportional to 
jj, ■ iV a_1 when associated with the Beta(2 — a, a)-coalescent, as was also noted by Birkner 
and Blath (2008). As emphasized by Eldon and Wakeley (2006), this discrepancy in 
timescales between different coalescent models complicates comparisons of predictions of 
genetic diversity when only 6 is used for mutation rate, since 6 associated with the usual 
Kingman coalescent is often not the same 6 as the one associated with a multiple merger 
coalescent. 

Care must also be taken with models of fluctuating population size. In Kaj and Krone 
(2003), a time-changed n-coalescent under a general model of variable population size is 
derived. More precisely, the authors consider a haploid Wright-Fisher model with population 
size N at generation r = 0 and consider a population size process M^ir)^ e Z of the form 

M N (r) = NX N (r), r G Z, 
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that is, X N (r) describes the 'relative population size' at generation r. Under the assumption 
that X N ([Nt\),t G R converges to something non-degenerate (i=..e. bounded away from 0 
and oo), they get the well-known limiting result that a time-changed Kingman coalescent 
describes the genealogy, where the infinitesimal coalescence rates are given by 1/V(s), with 



Our previously discussed exponential growth model corresponds to a growth rate of (3/N per 
generation in the pre-limiting model, and indeed we have 



Thus, the size Nt generations ago is approximately iVe - ^*. In practice, one needs to choose 
a reference population size (say at present time 0, from data), and then choose a growth 
function ^(s), normed in a way that it equals 1 at the current time 0 (one could also take a 
reference population size at some other time-point, and then rescale the population size to 
be 1 at that time-point for the computation of the coalescence rates). The chosen reference 
population size, say No, should then be used to convert time into generations. For example, 
if Sj v is the time, measured in coalescent units from the present, at which two out of 
currently j ancestral lineages coalesce, started from n leaves, 



is the corresponding time in generations at which the two ancestral lineages coalesce. 
Minimum-distance statistics 

To estimate power under point hypotheses we employed standard £ p minimum-distances 
measures. Minimal distance estimators are a popular tool to solve the problem of fitting 
data with a reasonable, though simplified model that cannot be absolutely exact. They 



is(s) = lim X N ([Ns\). 



(SH) 
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satisfy useful optimality properties (Millar, 1984) and are sometimes preferred over a 
maximum-likelihood approach (see e.g. Berkson (1980) for a discussion). 
The minimum-distance statistics X p and Y p we consider are defined by 



X p — 



'n-l 

Els 



i/p 



(n) 



E 



and 



,i=i 



'n-l 



(S12) 



Y, 



(n) 



(n),n 



,1=1 



(S13) 



where E n 



(n) 



resp. y?- n ' ) ' n i s obtained under the null hypothesis (II). Figure S4 shows 



graphs of X 2 and Y 2 as functions of a when E* 



(n) 



(and E K 



(n) 



is compared to E 



B 



(n) 



(Figure S4, left), or <^- n ^ compared to <p^' (and <£>- n ^ ,K ; Figure S4, right) for different values 
of the exponential growth parameter f3 and sample size n as shown. 

To account for the difference in variance of resp. one would standardise the 



deviations (£ 



E r 



(«) 



resp 



■(c 



- 1 — V 9 !™' 1 ' 11 ) by the standard deviation of ^ resp. 



(l n \ However, one can only compute the variance 



(n) 



by a recursion, which is 



0(n 5 ) (Birkner et al, 2013b), and thus only works for quite small n. In addition, we do 
not have a way of computing Var 11 (Ci n j ■ 

Distributional convergence results, with knowledge of convergence speed, would allow the 
application of the Millar (1984) machinery, leading to confidence bounds on the minimal 
distance estimators. Unfortunately, such results are elusive in most cases, and where known, 
indicate very slow convergence speed. 



Power computations 

The power of a statistical test is the probability of rejecting a false null hypothesis. By the 
Neyman-Pearson lemma, a likelihood-ratio test has the highest power of all tests of the same 
size. However, since the distribution of the site-frequency spectrum is unknown, we consider 
the minimum-distance statistics X p (S12) and Y p (S13). 
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To estimate the power, say, when the null hypothesis is of exponential growth (E) with 
growth parameter (3, and the alternative of a Beta(2 — a, a)-coalescent (B) with parameter 



a, the expected values, E t 



(nj 



were estimated under population growth using simulations, 
making use of the results of Griffiths and Tavare (1998). The quantiles q x of the statistic 
(X p or Y p ) corresponding to a given size 1 — x G {0.01,0.05,0.1} of the test were then 
estimated from 10 5 site-frequency spectra simulated under population growth. Since the 
null hypothesis is assumed to be wrong, site-frequency spectra are now simulated under a 
Beta(2 — a, a)-coalescent with a given parameter a, and the power estimated as the fraction 
of times the computed statistic equalled or exceeded the estimated quantile q x . In notation, 
let £( n )' B denote the site-frequency spectrum associated with the Beta(2 — a, a)-coalescent 
resp. exponential population growth (£^' E ), ie. when is simulated under the model 
indicated. We estimate the quantiles q x of the statistic 



X 



E,E 



Ek 



(n),E 



1/p 



(») 



Jn),E 



by simulating values of Cj' 1 ''^- Define the statistic X^' E by 



X 



B,E 



^| e (n)3_ E E 



0 



(n) 



1/p 



which refers to population growth being the null hypothesis, and the Beta(2— a, a)-coalescent 
being the alternative. The power P ( Xp A '^ > q x j of the statistic X^ A ' ^ is estimated by 



P (^ A ' E) > q> 



6=1 



X 



d,A,E) 



>9a 



where Xp b,A '^ denotes the value of the statistic Xp A '^ computed for the 6-th replicate and 
R the number of replicates. In a similar way, of course, one can estimate power when the 
null hypothesis is of a Beta(2 — a, a)-coalescent with alternative being exponential growth. 
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Figure S5: Estimation of P (Y 2 > q x ) - the power for the Y 2 norm (S13) for 
x G {0.9,0.95,0.99} as shown in the legend - as a function of sample size n G 
{10, 25, 50, 75, 100, 150, 200, 250, 300, 500, 750, 1000}. The alternative is exponential growth 
with f3 as shown, and the null of a Beta-coalescent with a G {5/4, 3/2, 7/4} and total number 
of segregating sites s as shown, with mutation rate \i per generation estimated from s each 
time. Expected values, quantiles, and power estimates based on 10 5 iterations each. 
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Figure S6: Estimation of P yX^ > q x ) - the power for the X 2 norm (S12) exclud- 
ing the singletons for x as shown in the legend - as a function of sample size n G 
{10, 25, 50, 75, 100, 150, 200, 250, 300, 500, 750, 1000} and time computed in generations with 
current population size N 0 = 10 4 . The alternative hypothesis is of an exponential growth 
with (3 as shown, and the null of a Beta-coalescent with a and total number of segregating 
sites s as shown, with \i estimated from s each time. Expected values, quantiles, and power 
estimated based on 10 5 iterations each. 
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Figure S7: Estimation of P \Y£ > q x j - the power for the Y 2 norm (S13) exclud- 
ing the singletons for x as shown in the legend - as a function of sample size n G 
{10, 25, 50, 75, 100, 150, 200, 250, 300, 500, 750, 1000} and time computed in generations with 
current population size N 0 = 10 4 . The alternative hypothesis is of an exponential growth 
with (3 as shown, and the null of a Beta-coalescent with a and total number of segregating 
sites s as shown, with \i estimated from s each time. Expected values, quantiles, and power 
estimated based on 10 5 iterations each. 
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Deriving the expected SFS under variable population size 

The computations outlined below for the reader's convenience are essentially contained in 
POLANSKI et al. (2003). 

For notational simplicity, we drop the index n for the sample size and the index IT resp. v for 
the underlying coalescent model resp. the demographic history. Consider a time-inhomogeneous 
n-coalescent where each pair of lines present at time t merges at rate v(t), and recall that T/%, 
k = n, n — 1, . . . , 2 denote the length of the time interval while there are exactly k lineages, and let 
Sj := T n + T n _i + • • • + Tj, j = n, . . . , 2, the time point when the number of lines jumps from j to 
j — 1, with and S n+ i := 0. Then, from (Griffiths and Tavare, 1998, (2.4)), we have 

P(T k e (t,t + dt)\S k+1 = s) = (^\u(s + t)exp(-(^\ J* + v(u)dv) dt (S14) 

so the joint density of T m , . . . , T n (2 < m < n) is 

n(*>(^) x -p(-t(*)HM.»4 m 



k= m ^ ' k ^ n v h=m v J J 5 y 

k+l<j<n 



Hence, the joint density of the S m , . . . , S n is given by 

LI uV(sfc) Xex p( ~ Yl U j / v (u)du) 

k=m W V k=m W Js *+i ' 



ns m )e-^ F ^x (%^y- {k - 1)F{Sk \ (S16) 

k=m+l ^ ' 



(0 = s n +i < s n < s n _i < ••• < s m , 2 < m < re), with F(t) := v(u) du where we used that 
(2) = T!j=2U - !) to ex P ress 

£ (2) / = £ E C*- 1 )/ ' 

fc=m ^ ' Js k+l j=2k=jVm Jsk+i 

= / + (j-1) / !/(«)d«. 

The marginal density of at fixed s m can e.g. be found by integrating out over s n < s n _i < • • • < 
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s m+i (< s m ), it can be expressed as a "generalised mixture" of densities 1 , as follows: 

fs m (s m ) = W'm)Qe"^ W (S17) 

j=m 

where the coefficients c|i'"' can be computed (backwards) recursively: Let ch' n ^ = 1, and 

c { ri' n) = -c ( ^\ {2) j = m + l,m + 2,...,n, 2<m<n-l, (S18) 

(2) - (2) 

4 m ' n) = E C S , :! 7JY A !7^ = 1 - E ^ 2<m<n-l, (S19) 

j=m+l I2J I2J j=m+l 

(and c^'"' = 0 for £ < m). An explicit formula is 

„o»_ TT (9 - f ni-m Pi- 1 )'"^^? )(m) .„,„. 

™ ~iL(S)-©~ ( ' m • ( ' 



We check (S17) by induction: /s„(s n ) = (^(a^e - ®*^, 

using the induction hypothesis we 

find from (S14) 



f,S m (Sm) = ft(%)^(") WS ™ hf(%+l)) X /5 m+1 ( Sm+ l)^m + l 

"W"^- 1 E «&l(a) p( s ^) ex p(-((2)-(2))^(^ + i))d Sm+ i 

i— \ / JO 




1 (S17) is a generalisation of the well known fact that the density of a convolution of exponentials with 
different rates can be written as a linear combination of exponential densities (set v{-) = 1 and thus F(t) = t 
in (S17)): X±, . . . ,Xf- indep., Xi ~ Exp(A^) (and the Ai pairwise different, say Ai > A2 > • • ■ > At), then 
the density of X\ + ■ ■ ■ + Xf. is a>i^i e ~ XiX with a, = Tlj^i ^3 1 (A? ~ ^t) which can also be easily checked 

by considering the characteristic functions. 
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noting that 

d 



a e 



ds 

for a£R. For the second equality in (S19) note (either inductively or from the fact that (S17) must 
be a probability density) that 

n 

c<£ n) = 1 for 2 < m < n. 

j=m 

The explicit form (S3) can be checked by verifying inductively that the expression on the right 
solves (S18)-(S19) (with the interpretation that the empty product = 1). While checking (S18) is 
straightforward, for checking (S19) one observes that for pairwise different Ai, . . . , A& ^ 0 and any 
z G C we have 

n^E^n^V put to see that J] D ^ = ' 

£=l c j = l J l<i<k 1 J j=l l<i<k 1 J 

(by partial fraction expansion, observe that both sides of the left equation are meromorphic functions 
of z with k simple poles at Ai, . . . , A& whose Laurent coements agree). 
To check the second equality in (S3) write for fixed j 6 {m, . . . , n} 

n G) TT »(* - 1) 

(i\ _ (j\ 11 



^<^<n \2) \2) m<i<n y J ' K J ' 



i fin * -> no i — 1—L.l ' 



( iy - m 2./- I »' (»-D! (///+./ -2)! i _ 

(-1) 



j(jf — 1) (m — 1)! (m — 2)! (n + j — 1)! (j — m)\ (n — j)\ 
,_ m (2j-l)m n! (j + m-2)! j! j!(n-l)! 
J'O'- 1 ) J ! ( ra -J') ! (m-2)!j! m!(j - m)\ (n + j - 1)! 

>n\ (j+m-2\ ( j - 



j(j-i) (»•; ') 
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The tail of S m is given by (assuming -F(oo) = oo so that all S m are a.s. finite) 

j=m 



Recalling 

aj = r°e-® F Vd8, 
Jo 

(assuming that F grows sufficiently fast at oo) then gives the desired result: 



poo " 

E[5 m ] = / F(S m >s )ds=y j c, 
Jo 



m U J i 



(S22) 



and 

n 

E[T k ]=E[S k -S k+1 ] = c^a k + £ (cj>> - cft^a, . 

For the case of exponential growth, i.e. v(t) = and 

F(t) = v{u) du = /T 1 (e pt - l) , 
we get for c > 0 

poo poo / \ 1 f°° flu 1 

/ e~ cF ^ ds = e c ^ / exp ( - %eP' ) = ^e c ^ / e""— = -e^ifcAS) (S23) 
Jo Jo v p y P Jc/p u p 

where we substituted %e@ s = u and 

Ei(t)= / dx= dx 

Jt x Ji x 

is an exponential integral function (e.g. (Abramowitz and Stegun, 1964, 5.1.1)). In particular, 
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we get 

1 



, -exp r 1 © )E 1 r 1 ® - (S24) 



For the case of algebraic growth consider u(t) = t 1 for some 7 > 0. Then 



F(t)= [ s~< ds = ^—t~< +1 
Jo 7 + 1 

and for c > 0, 

roc noc roo 

/ e -<*(>) ds = / e-cfT+i)- 1 -^ 1 ds = c -l/(7+i) 7 -7/(7+i) / e -« u -7/(7+i) du 
JO io ■/ 0 

= c -l/(7+l) 7 -7/(7+l) r (l/( 7 + 1)) (S25) 

where we substituted u = ct 7+1 /(7+l), hence du/dt = ct 1 = c 1 /( 7+1 )7 7// ( 7+1 )u 7// ( 7+1 ). In particular, 
we obtain 

r(i/(7 + i)) /0V 1/(7+1) rq9fi , 

7 7/(7 + d y • (s26) 



ABC analysis of the cytochrome 6 mtDNA data of AflNASON (2004) 

To investigate which model class fits better to the data, we use the ABC model comparison 
approach given the (lumped) nfSFS of the observed mitochondrial locus. The growth model 
class is specified by an uniform prior on {0, 10, 20, ... , 1000} for the class of growth models 
and the class of Beta n-coalescents by an uniform prior on {1, 1.05, . . . , 2}. Due to the 
numeric difficulties of evaluating the exact expected tree length for the used models, we 
approximate E n [B^~\ for the prior mutation rate gn ^(n)j corresponding to a chosen a 
or (5 by the mean value from 10,000 simulations. We use a tolerance level of 0.005 and 
perform n reps = 200, 000 simulations for each model class. See Table S4 for the approximated 
Bayes factors g E ^ B for the model comparison of the growth model and the Beta n-coalescent 
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model using different lumps of the nfSFS as summary statistics. The observed data fits 
slightly better to the growth model, but not so much better that we could discard the 
Beta n-coalescents as possible genealogy models for this locus. Jeffreys (1961) suggested 

Table S4: Approximated Bayes factors given the Atlantic cod mtDNA data 
lumping number 



g E / B 



10+ 50+ 100+ 200+ no 
6.435 1.74 2.378 3.264 3.175 



interpreting Bayes factors according to the log 10 scale. Lumping at 10 (Table S4) then gives 
'substantial' (1/2 < log 10 (^ E / B ) < 1) evidence against the Beta(2 — a, a)-coalescent in favor 
of exponential growth. Using KASS and Raftery (1995) suggestion of considering Bayes 
factors on 21og e scale gives 'positive' (2 < 21og e (£> E / B ) < 6) evidence in favor of exponential 
growth, based on lumping at 10. 

Additionally to the ABC model comparison, we also evaluate which parameters fit best 
to the observed nfSFS at the mitochondrial locus. For each model class used, we record the 



prior parameters from the 0.5% of the n reps = 200,000 simulations that have the smallest 



distance to the observed nfSFS (summary statistics). This gives an approximate sample of 
the posterior distribution of ir(a\ observed (^) resp. 7r(/3| observed C^)- Analogously, we 
also used the lumped nSFS as summary statistics. Figure S8 shows the posterior distributions 
for different lumping numbers. 
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Figure S8: Approximate ABC sample from ABC fitting of the A growth and B Beta n- 
coalescent model classes to the observed nfSFS in the Atlantic cod data. Denote by a the 
Beta n-coalescent parameter, j3 the growth rate. Priors were uniform on both sets. 
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ABC quality control for the Arnason (2004) data 

We follow the recommendation from the R package abc (Csillery et at, 2012) and perform 
three checks of quality for the presented ABC approach. We focus on the lumping which 
gives the clearest distinction, namely the lumping of all classes with mutation counts 10 or 
higher (class 10+). All checks are performed using the R package abc 

To assess the general ability to distinguish between the two model classes in the setting (i.e., 
number of observed mutations and sample size) given by the Atlantic cod mtDNA data from 
Arnason (2004), we again employ a leave-one-out cross-validation as described in Methods. 
See Table S5 for the results. 

Table S5: Mean posterior probabilities and misclassification probabilities for the ABC model 
comparison for tolerance x = 0.005 and n = 1278. We use the number s = 39 of observed 
mutations to estimate the mutation rate via Watterson's estimator and use the nfSFS (10+ 
lumped) as summary statistics. We use n cv = 12, 000 cross validations. 

E B [7r(E|Q] %Kb|Q] E b [vr( g E / B > 1|Q] E e [vr(g E / B < 1|Q] 
0.283 0.238 0.232 0.13 

To assess the quality to distinguish the parameters within one model class, we again use 
leave-one-out cross-validations (n cv = 12,000). The parameter of each simulation chosen 
for cross-validation is estimated as the median of the 0.5% of simulations with the smallest 
£ 2 distance to the chosen simulation. Figure S9 shows the resulting scatter plots of the 
parameters of the chosen simulations and the corresponding estimations. 

Figure S9: Scatter plots of estimated vs. true parameters of n cv = 12000 cross-validated 
simulations in the A Beta coalescent model class B growth model class 
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1.0 1.2 1.4 1.6 1.8 2.0 0 200 400 600 800 1000 



True a True p 



Downloaded from http://biorxiv.org/on September 18, 2014 



To see whether the posterior distributions given the cod mtDNA data from Arnason 
(2004) define models under which the observed data is reproducible, we perfomed posterior 
predictive checks by simulating the 10+ lumped nfSFS under the posterior distribution (i.e., 
simulating once from each parameter set of each of the 1,000 accepted simulations) for each 
model class and compare these with the nfSFS observed. See Figure S10 for the results. 

Figure S10: Posterior predictive checks with 1,000 simulations of the nfSFS under the ap- 
proximate posterior distributions given the cod data from Arnason (2004) for the A Beta 
coalescent model class B growth model class. Asterisks denote the observed values in the 
data. 
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The quality checks reveal that we can't distinguish well within each model class, but 
moderately between model classes. The posterior predictive checks reveal that both model 
classes can produce the observed values, thus including possible (though not necessarily 
well-fitting) models for the data at hand. 
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Estimation of power for n = 100 



Figure Sll: Estimate of power as a function of log 10 j3 for (3 G {0, 1, 2, . . . , 9, 10, 20, ... , 1000} 
when the Beta(2 — a, a)-coalescentis the null hypothesis, and the test statistic is 
sup{£(tf,«e (n) ),?? e 6 0 } - sup{£(0,fW),i? E 0i } (5) , with ^(^,^ (n) ) the log of the Pois- 
son likelihood function (9) (no lumping). Values at log 10 /3 = —1 correspond to the Kingman 
coalescent (ft = 0). In A, 10 5 replicates; in B, 10 6 replicates. 
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Figure S12: Estimate of power as a function of a for a G [1,2] when the Beta(2 — 
a, a)-coalescentis the null hypothesis, and the test statistic is sup{£(i9, $ G 6 0 } — 
sup{£(>?,£ (n) ),$ G 0i} (5) , with e($,£ (n) ) the log of the Poisson likelihood function (9) (no 
lumping). Values at a = 2 correspond to the Kingman coalescent; 10 6 replicates for quantiles 
and power estimates. 
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